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ABSTRACT 

Having a precise vulnerability discovery model (VDM) would 
provide a useful quantitative insight to assess software secu- 
rity. Thus far, several models have been proposed with some 
evidence supporting their goodness-of-fit. 

In this work we describe an independent validation of the 
applicability of six existing VDMs in seventeen releases of 
the three popular browsers Firefox, Google Chrome and In- 
ternet Explorer. We have collected five different kinds of 
data sets based on different definitions of a vulnerability. 
We introduce two quantitative metrics, goodness-of-fit en- 
tropy and goodness-of-fit quality, to analyze the impact of 
vulnerability data sets to the stability as well as quality of 
VDMs in the software life cycles. 

The experiment result shows that the "confirmed-by-vendors' 
advisories" data sets apparently yields more stable and bet- 
ter results for VDMs. And the performance of the s-shape 
logistic model (AML) seems to be superior performance in 
overall. Meanwhile, Anderson thermodynamic model (AT) 
is indeed not suitable for modeling the vulnerability dis- 
covery process. This means that the discovery process of 
vulnerabilities and normal bugs are different because the in- 
terests of people in finding security vulnerabilities are more 
than finding normal programming bugs. 

Categories and Subject Descriptors 

H.4 [Information Systems Applications]: Miscellaneous 

General Terms 

Security, Vulnerability, Discovery Models Validation 

1. INTRODUCTION 

The vulnerability discovery process normally refers to the 
post-release stage where people identify and report security 
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flaws of a released software. Vulnerability discovery models 
(VDM) operate on the known vulnerability data to estimate 
the total number of vulnerabilities present in the software. 
Successful models can be useful hints for both software ven- 
dors and users in allocating resources to handle potential 
breaches, and tentative patch update. For example, we do 
not exactly know the day of major snow falls but cities ex- 
pect it to fall in winter and therefore plan resources for road 
clearing in that period. The effective planning is important 
because security bugs are different than "normal" bugs. A 
normal bugs might be filed and be scheduled for fixing in 
the next release. Meanwhile a security vulnerability might 
required an urgent patch to be shipped to customers lest 
their browser be subject to rogue campaigns. Major shifts 
in browser usage are often attributed to (real or perceived) 
"more" security. Understanding the security trend is there- 
fore important. 

In this paper we consider six proposed VDMs. The first 
model is Anderson's Thermodynamic(AT) [5]. Rescorla pro- 
posed two other models [T^ : Quadratic (RQ) and Exponen- 
tial (RE). The fourth model considered here is Alhazmi & 
Malaiya's Logistic (AML) model [T]. The fifth is directly 
derived from a software reliability model. Logistic Poisson 
(LP) (a.k.a Musa-Okumoto model). The last model is the 
simple linear model (LN). 

Among these models, the AML model has been subject 
to a significant experimental validation: from operating sys- 
tems B H m m {i.e., Windows NT/95/98/2K/XP, Red- 
hat 6.2/7.1 and Fedora) to browsers [21] {i.e., IE, Firefox, 
MoziUa), and web servers [201 HI] (*-e., ISS, Apache). The 
results reported in the literature show that there is not 
enough evidence to neither reject nor accept AML. Three 
browsers were considered: one is strongly accepted by AML 
(MoziUa), one is strongly rejected (IE), and another one is 
unknown (Firefox). 

These inconsistent results may be caused by a combina- 
tion of factors. First, the authors did not clearly mention 
what a vulnerability is. For example, the National Vulner- 
ability Database (NVD) reports vulnerabilities which the 
security bulletin of vendors do not classify as such. 

The second problem is that the authors considered all ver- 
sions of software as a single application, and counted vulner- 
abilities for this "application". [Massacci et al.|[T4] has shown 
that each Firefox version has Its own code base, which may 
differ by 30% or more from the immediately preceding one. 
Therefore, as time goes by, we can no longer claim that we 
are counting the vulnerabilities of the same application. To 
explain visually this problem. Figure [l] shows in one plot the 
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The figure shows the cumulative vulnerabihtics reported for 
six releases of Chrome (Chrome 1.0 to 6.0) by the number 
of months since release. Different trends of different releases 
suggest that different discovery model should be applied for 
each release. 

Figure 1: Google Chrome firework of vulnerability 
discovery trends 

cumulative vulnerabilities of the different versions of Chrome 
in which we restart the counters for each version. It is imme- 
diate to see that there is not a single "trend" but a "firework" 
effect where each version determines its own trajectory. 

1.1 Contribution of this Paper 

This paper presents an independent validation experiment 
on the goodness-of-fit of six existing VDMs against the three 
most popular browsers; Firefox, Google Chrome and Inter- 
net Explorer. 

We also analyze the impact of vulnerability data sets based 
on different definitions of vulnerability to the VDM's perfor- 
mance. Basically, the contribution of this paper is as follows. 

• We introduce two qualitative metrics, namely goodness- 
to-fit entropy and goodness-of-fit quality, to assess the 
stability and quality of the goodness-of-fit of a VDM 
in a certain data set. 

• We show that some model (AT) does not work at all. 
Reliability models do not seem to apply which is an 
empirical confirmation that security is essentially dif- 
ferent than reliability. Among six analyzed models, 
AML seems to be superior in terms of goodness-of-fit 
quality. 

• The definition of vulnerability does indeed impact the 
conclusion of a VDM study. If ones stick to a-vulnerabiUty- 
as-an-NVD {e.g., NVD, NVD. Advice in our study) as 
the main source for counting, confirmed-by-vendors' 
advisories NVD entries would yield more stable results 
than raw NVD. 

• We found that long life evolving software may have 
more than one saturation periods when the number 
of discovered vulnerabilities slowly increase, but then 
continue increasing linearly. This probably is the effect 
of code inheritance i.e., a large amount of lines of code 
in the new code base is inherited from old ones. 

The rest of the paper is organized as follows. In the sub- 
sequent section we present the related work (g2|. Then we 
describe our research questions and how to find out the 



answers (^Sk. Next we briefly discuss existing VDMs and 
their formulae (Q. Then we present how we collect vul- 
nerability data sets used for the validation purpose (|5]). 
After that, we discuss the methodology to conduct the ex- 
periment, and a discussion about the result in our experi- 
ment (!|6|. Next, we discuss the impact of data sets to the 
goodness-of-fit of VDMs {[7\. We then study the evolution 
of VDMs' goodness-of-fit (^81, and the quality of VDMs (@ 
in the software life cycles. After a discussion about poten- 
tial threats { ^W\ to the validity of our work we conclude the 
paper (O). 



2. RELATED WORK 



Anderson [5] discussed the trade-off in security in open 
source and close source systems. On one side 'to many eyes, 
all bugs are shallow', but in the other side, 'potential hackers 
have also had the opportunity to study the software closely 
to determine its vulnerabilities'. In this work, he proposed 
a VDM (a.k.a. Anderson Thermodynamic, AT) based on 
reliability growth models, in which the probability of a se- 
curity failure at time t, when n bugs have been removed, 
is in inverse ratio to t for alpha testers. This probability is 
even lower for beta testers, A times more than alpha testers. 
However, he did not conduct any experiment to validate the 
proposed model. 

In other work about vulnerability discovery between white 
hat (security researchers) and black hat (hackers), |RescorIa| 
|17| discussed many shortcomings of NVD, but his study 
heavily relies on it nonetheless. Rescorla proposed two math- 
ematical models, called Linear model (a.k.a Rescorla Quadratic, 
RQ) and Exponential model (a.k.a Rescorla Exponential, 
RE). He has performed an experiment on four versions of 
different operation systems {i.e., Windows NT 4.0, Solaris 
2.5.1, FreeBSD 4.0 and RedHat 7.0). AU of the cases, the 
two models were able to fit the data with p-value ranged 
from 0.167 to 0.589. In fact, we could not find any significant 
difference between these models in term of goodness-of-fit 
by doing a Wilcoxson test on their reported result {p-value 
> 0.05). 

Alhazmi and Malaiya| proposed another VDM inspired 
by s-shape logistic model, called Alhazmi Malaiya Logistic 
(AML). The idea beyond is to divide the discovery process 
into three phases: learning phase, linear phase and satura- 
tion phase. In the first phase, people need some time to 
study the software, so less vulnerabilities are discovered. In 
the second phase, when people get deeper knowledge of the 
software, much more vulnerabilities are found. In the fi- 
nal phase, since the software is going out of date, not much 
people will use it. People lose interest in finding new vul- 
nerabilities. So the cumulative vulnerabilities are stable. In 
this work, the authors validated their proposal against sev- 
eral versions of Windows {i.e.. Win 95/98/NT4.0/2K) and 
Linux {i.e., RedHat Linux 6.1, 7.1). Their model fitted Win 
95 very well {p-value = 0.999991), and Win NT4.0 {p-value 
= 0.923). For other versions, the p-value ranged from 0.054 
t o 0.317. 

Alhazmi and Malaiya 3 compared their proposed model 
w th Rescorla's(2005 (RE, RQ) and Anderson'sj2002j(AT) on 
Windows 95/XP and Linux RedHat Linux 6.2, Fedora. The 
result shows that their logistic model has a better goodness- 
of-fit than others. For Windows 95 and Linux 6.2, as the vul- 
nerabilities distribute along s-shape-like curves, only AML 
is able to fit it {p-value=l), whereas all other models fail to 



match the data {p-value < 0.05). For Windows XP, the story 
is different. RQ turns to be the best one with p-value— 0.97, 
while AML poorly match the data {p-value=0.147). 

Wooet cir]|21| carried out an experiment on three browsers 
IE, Firefox and Mozilla. However, it is unclear which ver- 
sions of these browsers were analyzed. We speculate that 
they did not distinguish between versions. This could have 
a significant impact to their final result as we show later 
in the paper. In their experiment, IE has not been fitted, 
Firefox was fairly fitted, and Mozilla was good fitted. From 
this result, we could not conclude any thing about the per- 
formance of AML. 

In another experiment, |Woo et al. [20| validated AML 
against two web servers: Apache and IIS. Also, they did not 
distinguish between versions of Apache and IIS. In this ex- 
periment, AML has demonstrated a very good performance 
on vulnerability data of these web servers (p-value = 1). 

3. RESEARCH QUESTIONS 

The primary question is "does a model fit the observed 
data?". When a new VDM is proposed, the authors have 
done some experiment to validate the applicability of this 
VDM. Mostly, in their reports the proposed VDMs often 
have good goodness-of-fit measures. As time goes by, the 
goodness-of-fit may improve or deteriorate as more data be- 
come available (either in terms of data point for the same 
software or new software to be considered as an instance). 
This motivate our first research question: 

RQl Are existing VDMs able to fit cumulative numbers of 
vulnerabilities of the popular browsers (i.e., IE, Fire- 
fox, and Chrome)? 

To find the answer, we discovered another, major and al- 
most foundational issue: "what is a vulnerability?". Most 
related work did not explicitly discuss this question. Nor- 
mally, a vulnerability is a security report describing a par- 
ticular problem of a particular application, for instance: a 
report in Mozilla Foundation Security Advisories (a.k.a an 
MESA entry), or an NVD report of NIST (NVD entry). In 
the wisdom of many people, an NVD entry is a vulnerabil- 
ity, but there are many other definitions [6) [T] [51 (9] 1111 [18] . 
This raises the second research question in our study. 

RQ2 How do different definitions of vulnerability impact the 
VDMs' goodness-of- fitness? 

Figure [2] illustrates the vulnerability space of Firefox, in 
which different 'kinds' of Firefox vulnerabilities are coexisted 
at different level of abstraction. 

• Mozilla Bugzilla: contains very technical reports for 
vulnerabilities, but also other normal programming bugs. 
Bugzila entries, called bugs, are visualized as black cir- 
cles in the figure. 

• NVD: holds high level third-party security reports for 
several applications, including Firefox. Many NVD 
entries (gray ovals) mentioning Firefox maintain ref- 
erences to Bugzilla (black circles inside ovals). 

• MFSA: are set of vendor's high level security reports 
for Mozilla's products. Each MFSA entry (rounded 
rectangle) always references to one or more bugs (black 
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This illustrates different abstract levels of vulnerability: devel- 
oper level (Bugzilla) to user level (MFSA, Bugzilla). Bugzilla 
entry denotes technieal programming issues (both security and 
non-security ones). Security bugzilla are ones reported in an 
MFSA, or referenced by an NVD. 



Figure 2: The vulnerability space of Firefox. 

circles inside) responsible for this security flaw. MFSA 
also holds links to corresponding NVD entries (over- 
lapped ovals). 

Depend on the judgement of analysts, different numbers of 
vulnerabilities are observed and collected. Here, in Figure[2] 
if we define a vulnerability is an MFSA, or NVD, or Bugzilla, 
these numbers are respectively six, ten and fourteen. So 
which is the actual number of vulnerabilities? This is also 
the target of our third research question. 

RQ3 Among vulnerability definitions, which is the most ap- 
propriate in which VDMs yield most stable result? 

In the fourth research question, we address the fact that 
the fitness of a model might evolve over time. Then a model 
might only be good at some times, but be deteriorate later. 
Therefore, this research question focuses on the fitness of a 
VDM in the lifetime of software products. 

RQ4 Among existing VDMs, which one is globally supe- 
rior? 

To work out these issues, we collected vulnerability- as- 
an-NVD data set for the three popular browsers. Then we 
fitted existing VDMs using observed data, and see how well 
they are (RdTl). Next, we collected other data sets with 
respect to other definitions, and fitted VDMs by these data 
sets (Rd2|. We estimated the entropy of goodness-of-fit for 
each data set to know in which data set, VDMs may yield 
more stable result. This estimation is used to justify data 
sets (RC|3|. And finally, we ranked VDMs based on their 
goodness-of-fit during the life time of software (RCM). 

4. VULNERABILITY DISCOVERY MODELS 

This section provides a quick glance about six VDMs. As 
denoted in [3] , these VDMs are main features of the vulner- 
ability discovery models. Here, only the formulae of these 
six models are discussed. The detail rationale of models as 
well as the meaning of each parameter can be found in the 
original work or in [3]. All these parameters are estimated 
using non-linear regression on observed data. 

• Alhazmi-Malaiy a Logistic ('AMi^: proposed by Alhazmi 
& Malaiya [2], inspired by the s-shape curve. The ra- 
tionale behind is the assumption that vulnerability dis- 
covery process is accounted into three phases. Learn- 
ing phase: software has just been released, people need 



Table 1: Formal definitions of six VDMs in the 
study. 

This table lists existing VDMs considered in the alphabetical order. 
The rationale of formulae and the meaning of each parameter should 
be found in original work of each model. All parameters are estimated 
based on non-linear regression on observed data. 



Table 2: Data sets collected for major releases of IE, 
Firefox and Chrome. 

Bullets (•) indicate enabled data sets. Dashes ( — ), otherwise, mean 
tlicre is no data sources available to collect the data sets. 



Model 



Formula 



Alhazmi-Malaiya Logistic (AML) 

Anderson Thermodynamic (AT) 

Linear (LN) 

Logistic Poisson (LP) 

Rcscorla Exponential (RE) 

Rescorla Quadratic (R,Q) 



B 

"(*) = BCe-ABt + i 

n{t) = — ln(t) + C 

7 

a(t) = At + B 

Q{t) = /3oln(l + /3it) 
0(t) = N(l -e"^') 

n(t) = h Bt 

2 



time to learn new software. Vulnerabilities are slowly 
detected. Linear phase: people get acquainted to the 
software, more vulnerabilities are rapidly discovered. 
Saturation phase: software becomes stable (or people 
move to new software), less vulnerabilities are discov- 
ered. 

• Anderson Thermodynamic ('j4rj: the application of this 
model to vulnerabilities is proposed by [Anderson [S]. 
The author assumed that finding a vulnerability (or 
bug) after another one is much more harder as time 
goes by when the reliability of software increases. The 
term thermodynamic originates by the analogy from 
thermodynamics, in which 7 accounts for the lower 
failure rate during beta testing compared to higher 
rates during alpha testing. 



Linear model (LN): this is the simplest model, and well 
known by most people. Linear model is often used to 
express the trend line of data. 

Logistic Poisson (LP): is originated from the field of 
reliability engineering, also known as Musa-Okumoto 
model [TS] . The idea of the model was that "the fail- 
ure intensity would decrease exponentially with the 
expected number of failures experienced" 1 1 5| . In the 
formula, /3o represents the total number of faults that 
would eventually be detected, /3i is the per-fault hazard 
rate for the exponential model [12j . 

Rescolar Exponential (RE): this model is a work of 
Rescorla [T7], inspired by the Goel-Okumoto model [TO] 
in software reliability engineering, in which the reliabil- 
ity is increasing. The number of vulnerabilities discov- 
ered in a single product is assumed to follow a Poisson 
process. Then, in the formulal, A'^ is the eventually 
total number of vulnerabilities, and A is the rate of 
discovery. 

Rescolar Quadratic (RQ): is also proposed in [TT] while 
attempting to identify trends in the vulnerability dis- 
covery using statistical tests. The rationale behind is 
that the vulnerability finding rate varies linearly with 
time. The cumulative number of vulnerabilities is thus 
represented as a quadratic curve. 
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17/58 



5. DATA COLLECTION 

Vulnerability information for the three browsers IE, Fire- 
fox, and Chrome is available in multiple sources, from multi- 
vender source like National Vulnerability Database (NVD) 
to vendors' advisories and bug trackers e.g., Mozilla Founda- 
tion Security Avisories (MFSA), Mozilla Bugzilla, or Chrome 
Issues Tracker. To evaluate the impact of vulnerability def- 
initions to the goodness-of-fit of VDM, we collect different 
data sets with respective to different definitions. 

• NVD(X): a vulnerability is an nvd entry that mentions 
version X (i.e., version X appears in the vulnerable con- 
figuration section of this nvd entry). 

• NVD.Bug(X): a vulnerability is an nvd entry that men- 
tions version X. And this nvd entry has one or more 
links in the references section to a bug report of the 
software vendor. 

• NVD.Advice(X): a vulnerability is an nvd entry that 
mentions version X. And this nvd entry has one or 
more links in the references section to a security advi- 
sory report of the software vendor (the advisory report 
might not mention version X, but only some later ver- 
sion). 

• NVD.Nbug(X): a vulnerability is a bug report of the 
vendor. This bug report also appears in the references 
section of an nvd entry that mentions version X. 

• ADVICE. NBug(X); a vulnerability is a bug report of the 
software vendor. This bug report is also mentioned in 
an advisory report of the software vendor. This advi- 
sory report also has one or more links to an nvd entry 
that mentions version X. One exception is that many 
advisory reports of Firefox vl.O have no link to NVD 
and we considered all bugs mentioned in these advi- 
sory reports as vulnerabilities for vl.O even without 
nvd links. 

In the NVD.NBug(X) and ADVICE. NBug(X), we do not 
know the releases that a bug might impact, we assume that 
a bug impact all configurations mentioned in the nvd refer- 
enced by this bug. However, not all bugs explicitly reference 
to nvd. In this case, we apply the bug-to-nvd linking scheme 
which includes following rules: 

• if a bug is listed in the references of a nvd, this bug is 
linked to this nvd. 

• if a bug and a nvd are clustered in an advisory report 
[e.g., mfsa), this bug is considered to be linked to this 
nvd. 
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This figure illustrates feature goodness in Table [s] The circles indicate cumulative vulnerabilities at a certain time. The horizontal axis (X) 
is tinie-in-market measured by the number of months since officially released. The vertical axis (Y) is the cumulative vulnerabilities. Top 3 
indicates the order of VDMs sorted by p-values. The label next to the VDM's name in the legend shows the goodness-of-fit of this VDM. 

Figure 3: VMDs' goodness-of-fit on browsers in NVD data set. 



We finally collected 58 data sets, and we used these data 
set to run the VDM experiment on 17 major releases of the 
three browsers. The detail of which data sets are available 
on which releases is reported in Table p] 

6. VALIDATION OF VDM 
6.1 Validation Methodology 

The steps of validating VDMs are quite straight forward. 
We first observe the data. Here, they are cumulative num- 
bers of vulnerabilities monthly from the release date. Thus 
far all these models are mostly validated using vulnerability- 
is-an-NVD assumption, which corresponds to our NVD data 
set. Hence their data sets are collection of NVD entries pub- 
lished. To make our experiment comparable, we also use 
this definition of vulnerability, and run the goodness-of-fit 
experiment on the NVD data set. Besides, NVD is the only 
common data set among the three browsers (see Table [2|. 

Second, we fit VDMs into the all data points of the ob- 
served data using R 16 tool. Finally, expected values of each 
model are computed for the goodness-of-fit test. We employ 
chi-square (x^) goodness-of-fit for this purpose. This test is 
based on y^ statistics calculated as follows. 



= E 



[Or - Ei 

E, 



(1) 



Oi and Ei orderly denote the oberseved values come from 
observation, and expected values generated by VDMs. The 
smaller x'^, the higher goodness a VDM gains. In practice, 
a VDM is acceptably fitted if the x^ is less than a critical 
value, given a significant level (a) and degrees of freedom. 
The p-value here represents the significance of the differences 
between observed values and expected values. If the p-value 
is small, differences are significant, not by chance. Thus, the 
smaller p-value, the stronger evidence a VDM does not fit 
the data. Hence, we interpret the goodness-of-fit based on 
the ranges of p-value as follows 



• Not Fit: p-value £ [0 ~ 0.05), the difference is signif- 
icant, not by chance. This evidence is strong enough 
to reject the model. 

• Good Fit: p-value G [0.95 ~ 1.0], the difference, in 
opposite to the previous, is significant small. It is a 
strong evidence to accept the model. 

• Inconclusive Fit: p-value £ [0.05 ~ 0.95), there is not 
enough evidence to neither reject nor accept the model. 

6.2 Result and Discussion 

We run the goodness-of-fit experiment for six VDMs on 
seventeen releases using NVD data set. The experiment gen- 
erates 102 curves (and lines), so it is impossible to show them 
all. Figure [3l as for the illustrative purpose, only describes 
charts that highlight features in our result. 

Among analyzed releases, many releases are old, which 
are shipped to users many years ago, and many releases 
have just been recently released. This diversification would 
provide us a good picture the behavior of VDMs in differ- 
ent period of application. Vulnerabilities of old releases are 
intuitively more stable than that of younger ones. Hence, 
a good VDM should be able to capture the vulnerability 
distribution of old releases. 

To this extent, Figure [3] shows the fitted plots of VDMs 
in selected releases, i.e., IE6, IE7, IE8, Firefox vl.0-v3.6. 
Chrome 4, using NVD data set. We choose NVD data set 
to make our result comparable with others. We select these 
releases since they are more representative for two afore- 
mentioned groups of applications: old releases (i.e., IE6, 
Firefox vl.O, and IE7), and young releases (i.e., IE8, Firefox 
v3.6, and Chrome 4). In this figure, the cumulative numbers 
of vulnerabilities are illustrated as empty cycle, and fitted 
VDMs are visualized by lines with different patterns. We 
have six analyzed VDMs, but in this figure, we only show 
top three VDMs which have better results then others in 
terms of p-value. 

The vulnerability distribution of IE6, and IE7 are still in- 
creasing in a nearly linear manner. This might be these fol- 



lowing reasons. First, people are still interested in these two 
browsers since they are shipped with Windows XP which 
has a noticeable amount of users. Thus people keep search- 
ing vulnerabilities in these browsers. Second, there a lot 
amount of code base of IE6, 7 are inherited in later releases 
(i.e., IE8, IE9), then many vulnerabilities discovered later 
in IE8, IE9 are originated from retrospective releases (i.e., 
IE6, IE7). 

This data distribution could explain the goodness-of-fit of 
VDMs which support linear modeling. Thus, in IE7, the LP 
model fitted the data very well, RQ and LN models might 
fit the data. Other models (not shown here) did not fit 
data well because their hypothesis shapes are not appropri- 
ate. Meanwhile, even though the increasing of IE6's vul- 
nerability seems to be linear, but the variance of numbers 
of vulnerabilities around the perfect linear model falsifies 
most VDMs, except AML since the observed data forms a 
stretched S-shape. 

The chart of Firefox vl.O shows a different phenomenon, 
called after-life vulnerabilities in which many vulnerabilities 
are discovered after a release is out of official support [14] . 
Vulnerabilities of Firefox vl.O were discovered linearly in the 
first 20 months of life time, but then mostly constant in the 
next 20 months. However, this number is increasing later 
on until now. We speculate that when Firefox vl.O was re- 
leased, it attracted many attacks but later on people were 
losing interested in finding new vulnerabilities of this re- 
lease. Then the number of vulnerabilities increased because 
a large portion of code in Firefox vl.O is still alive in modern 
releases of Firefox [T?]. And many vulnerabilities reported 
later are also applied to this very first release. This kind 
of distribution challenges all analyzed VDMs since none of 
analyzed VDMs taken this phenomenon into account, and 
hence they are all false for Firefox vl.O. 

In the bottom parts of Figure [s] all the releases (IE8, 
Firefox v3.6, and Chrome 4) are still young. Thus the dis- 
tribution of vulnerabilities for these releases are linear. So, 
many VDMs that address linear model could fit the data. 

To have a overview picture about the performance of VDMs 
in NVD data set. Table [3] reports the goodness-of-fit for 102 
curves of all releases. Here, instead of reporting a big table 
of numbers, Table [3] shows the interpretation of p-value of 
the x^ tests. This presentation also helps to study at higher 
abstract level than the raw p-values. In this table, there are 
47 times VDMs can either well fit or inconclusively fit the 
data, and 55 times they do not work. Roughly speaking, 
the chance of fit is about 50%. If we look at each VDM 
particularly, the AML model appears to be the best one as 
it obtains more positive results than others. In contract, the 
AT model seems to be the worst because it could only fit one 
release (IE v7.0). Meanwhile, other models are equivalent 
in number of times being rejected and accepted, except the 
LP model which is likely a bit better. 

As a conclusion of this section, fitting VDMs into NVD 
data set give a hint that the assumption behind of the AML 
model is slightly appropriate to observed data. This idea 
apparently captures the way people discover vulnerability 
in practice. And in the first months of software's lifetime, 
the vulnerabilities of software increases linearly. Hence, any 
models support linear modeling could be able to fit the ob- 
served data. However, depend on the shapes of the mod- 
els, sometime a model is better than another ones. But we 
hardly say which one is better than the others, except, a 



Table 3: The goodness-of-fit of VDMs using data set 
NVD. 

The goodness of fit of a VDM is based on p-value in the x test. 
p-value < 0.05: not fit (— ), p-value > 0.95: good fit (X), and ineon- 
clusive fit (?) otherwise. 
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more confidence conclusion about the performance of AT 
model, which very poor is almost cases. The assumption 
of AT model is completely not applicable for vulnerability 
detection. 

However, since the goodness-of-fit of VDMs might change 
overtime, To have a better insight, in subsequent section, 
we will study the evolution of VDMs' goodness-of-fit with 
respect to the software life time. 

7. THE IMPACT OF DATA SETS 

Figure [4] displays the notched box plots of browser re- 
leases and data sets to the observed cumulative vulnera- 
bilities. The non-overlap notches between boxes indicate a 
statistically difference between their median. The distribu- 
tions of vulnerabilities in data sets indeed reflect the way 
they are collected. If we take NVD data set as a base line, 
the NVD. Bug and NVD. Advice are subsets of NVD that only 
select nvd entries which have one or more confirmed links 
to a bug report or security advisory, respectively, by ven- 
dors. Thus, the numbers of vulnerabilities in NVD. Bug and 
NVD. Advice are less than NVD. 

Meanwhile, NVD. Advice and NVD. Bug look quite similar. 
It is because many nvd entries which have links to vendors' 
security advisories, also have links to vendors' bug reports. 
So, these two data sets look the same. This is also confirmed 
by the statistical test. The Fligner-Killeen test on homo- 
geneity of variances shows that NVD. Bug and NVD. Advice 
are pretty homogenous {p-value = 0.996). 

The NVD.Nbug and Advice. Nbug, respectively, count num- 
bers of bugs in nvd entries and vendors' security advisories. 
So, NVD.Nbug and Advice. Nbug are basically multipliers of 
NVD. Bug and NVD. Advice. Since a vendors' security advi- 
sory entry, in the case of Firefox, often has more links to bug 
reports than a nvd entry does, the number of vulnerabilities 
of Advice. Nbug is larger than that of NVD.Nbug. 

In general, the non-overlap notches (except NVD. Bug and 
NVD. Ad vice) show a statistically difference between the me- 
dian among these five data sets. It gives a hint that different 
conclusions might be drawn from these data sets. 

To better understand how data sets impact to the per- 
formance of VDMs, we compare the distribution of p-value 
generated by fitting each data set to all VDMs. To make 
the comparison more precise, we try to use at much data 
points as possible. However, since some data sets are not 
available for some browsers (see Table [2|, we can only com- 
pare data sets in browsers that are supported by the data 
sets. In particular, we compare all five data sets in Firefox's 
vulnerabilities because all these five data sets provide data 
for Firefox. For Chrome and Firefox, we can only compare 
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Figure 4: The box plots of the cumulative vulnera- 
bilities of the data sets. 
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Box plots showing the distribution of p- values of all VDMs across data 
sets. Loft (a) shows the impact of data sets to the VDMs' performance 
in Firefox. Middle (b) reports the impact of shared data sets between 
Firefox and Chrome, and Right (c) is the impact of shared data sets 
between IE and Firefox. 



Figure 5: 
VDMs. 



The impact of data sets to tlie quality of 



NVD, NVD.Bug, NVD. Nbug. And for Firefox and IE, we 
can compare NVD and NVD. Advice. For IE, Firefox, and 
Chrome, NVD is the only data set support all these three 
browser, so we cannot make comparison. 

The effect of different data sets to VDM is more clearly 
presented in Figure |5] This figure reports the p-values dis- 
tribution of x^-test of all VDMs across data sets. The left- 
most box plot in this figure shows the difference among data 
sets while fitting Firefox's vulnerabilities. Apparently, the 
p-value's spectrum of NVD. Advice seems better than others: 
50% of the cases p-value is greater than 0.4, whereas, 50% of 
others is less than 0.2. It means that the chances of getting a 
good fit by choosing NVD. Advice is greater than other mod- 
els. This phenomenon is also appeared in the rightmost box 
plots for Firefox and IE. In the meanwhile, it seems there is 
not big difference among the medians of NVD, NVD.Bug and 
NVD. Nbug as demonstrated in the box plots of both Fire- 
fox and Firefox & Chrome. Though, NVD looks better than 
NVD.Bug and NVD. Nbug since its high quartile is greater 
than others'. 

In summary, the analysis shows an evidence that count- 
ing vulnerabilities in different ways, which result in different 
vulnerability data sets, would impact to the overall quality 
of VDMs. And fitting the data of NVD. Advice, VDMs have 
more chances to obtain Good Fit. This means that exist- 
ing VDMs can better model the trend of vulnerabilities that 
are both confirmed by NVD and vendors' security advisories 
(NVD. Advice) than other data models. 

8. THE EVOLUTION OF VDM'S GOODNESS- 
OF-FIT IN DATA SETS 

The observation done in the previous section (i|7|) pro- 
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There are three states in the model: Fit (F), Inconclusive(I) , and 
NotFit (NF). As more data available, a Fit model may remain Fit, 
or become Inconclusive, or NotFit. This evolution is represented by 
transitions. The labeled numbers on transitions denote transition's 
names. 



Figure 6: The goodness-of-fit transition model. 




nvd nvd.bug nvd. advice nvd.Nbug advice. Nbug 

The box plots illustrate the distribution of VDMs' goodness-of-fit en- 
tropy El in different data sets. The calculation of entropy follows (121 
with 3 = 1. 



Figure 7: Box plots of entropies {/3 — 1). 



vides an evidence that existing VDM can model the trend of 
vulnerabilities reported in NVD. Advice data set better than 
other data sets at the time when data sets are collected. For 
a better insight, this section aims to analyze whether this 
phenomenon is consistent for a long period or just happen by 
chance. The analysis result will also address the RCjS] about 
choosing the most appropriate data set that is more suitable 
for VDMs. The selection criteria are not only the data set 
that can be well modeled by VDMs (i.e., VDMs would ob- 
tain more Good Fit), but more important, the data set in 
which the performances of VDMs are more stable than in 
other data sets. 

To this purpose we run the goodness-of-fit experiment 
during the life time of analyzed releases. For each releeise, 
we observe the evolution of VDMs' goodness-of-fit with re- 
spect to the evolution of data set during the release's life 
time. The life time of a software is the number of months 
since release time (MSR) . The first MSR of a software is the 
end of the month after the released date. The second MSR 
is the end of month after the first MSR, and so on. For ex- 
ample, IE v4.0 is released in September, 1997rl hence, the 
first MSR is on 31 October, 1997, and the first observation is 
on the sixth MSR, 31 March 1998. The observation begins 
at the sixth month when a release is officially shipped to 
users, and repeats monthly until the last day when data is 
collected. The cumulative numbers of vulnerabilities at ob- 
servation points are fitted into all VDMs. The experiment 
generated 14, 817 curves in total. 

Let us consider one VDM. When fitting data to this VDM, 
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This figure shows the detail evolution of goodness-of-fit entropies of all data sets. The observation period for caeh type of data set depended on 
the the produets' lifetime supported by the data sets. The solid lines indicate the grand median of entropies in the whole period. The dash lines 
and dotted lines show the median of entropies for the first-half and second-half period. The performance of VDMs are more stable if the median 
of the second-half is less than or equal the median of the first-half. 

Figure 8: The evolution of goodness-of-fit entropies of all data sets. 



we can get either Good Fit, or Inconclusive, or Not Fit. Sup- 
pose that at the observation time t and i -I- 1, the goodness- 
of-fits are GoFt, and GoFt+i, respectively. If GoFt equals 
GoFt+i, we say that the VDM is stable during period {t..t + 
1), otherwise the VDM is unstable. We introduce a measure- 
ment for the stable of VDMs, called goodness-of-fit entropy, 
by counting the number of times that the goodness-of-fit of 
a VDM changes. 

To formally define the goodness-of-fit entropy, we use the 
goodness-of-fit transition model as depicted in Figure|6] The 
model consists of three states: Fit (F), Inconclusive (I), and 
Not Fit (NF). The VDMs' goodness-of-fits are initially clas- 
sified into one of these states in the 6th MSR. The goodness- 
of-fit states can be subsequently evolved to other states ac- 
cording to the transitions. The model has total nine transi- 
tions labeled from 1 to 9, denoted as #i, falling into one of 
three categories, unchanged, small jump, and big jump. The 
unchanged transitions (#1, #2, and #3) mean the states are 
unchanged, in order words, there is no entropy. The small 
jump transitions (#4-#7) denote a smaller change (com- 
pared to big jump) of p-value from Good Fit (> 0.95) to 
Inconclusive ([0.05. .0.95)), or from Inconclusive to Not Fit 
(< 0.05), and vice versa. In the meanwhile, big jump transi- 
tions (#8, #9) show a big change from Good Fit to Not Fit 
and versa. 

The goodness-of-fit entropy at observation time t is esti- 
mated by counting the number of transitions when moving 
from time t — 1 to time t. Since the the levels of instabil- 
ity of transitions are not equal, the contribution of different 
kinds of transition into the overall entropy might be differ- 
ent. Since unchanged transitions do not contribute to the 
entropy, we define j3 as the factor that big jump is /3 times 
as chaos as small jump. The calculation of goodness-of-fit 
entropy follows this equation: 



Epit) 



\smalljump\t + /? ■ \bigjump\, 



\unchanged\t + \smalljump\t + /3 ■ \bigjump\, 



(2) 

where \XXjump\t is the numbers of XXjump transitions 
when moving from time t — 1 to time t. 

The goodness-of-fit entropy measured in S ranges from 

to 1. Entropy equals when \small jumps -\-bigjump\ = 0. 
It denotes a local stability of goodness-of-fit when moving 
from time t — 1 to time t. On the contrary, entropy equals 

1 when \unchanged\ = 0, which is a complete chaos, 
conclusion: These small medians show an evidence that 

VDMs's goodness-of-fits are somehow stable within these 
data sets. In detail, the overlapped notches among NVD, 
NVD.Advice, NVD. Nbug and ADVICE. Nbug give a hint that 
the stability of VDMs in these data sets is not much dif- 
ferent. Meanwhile, the median of NVD. bug is significant 
greater than others'. 

The box plots in Figure IT] report the distribution of the 
evolution of goodness-of-fit entropies. Generally, about 75% 
of the cases of most data sets, the entropy is less than 0.1. 
Moreover, the overlapped notches among NVD, NVD.Advice, 
NVD. Nbug and ADVICE. Nbug give a hint that the medians 
of entropy of these four data set arc not statistically dif- 
ferent. Meanwhile, the median of NVD. bug is significant 
greater than others'. This observation is confirmed by the 
Kruskal-Wallis rank sum test on the variance of entropies. 
The null hypothesis is "there is no difference between the 
median of entropies among data sets". The Kruskal-Wallis 
test four data sets NVD, NVD.Advice, NVD. Nbug and AD- 
VICE. Nbug yields p-value = 0.271, which means we do not 
have enough evidence to conclude about the difference their 
medians. And the Kruskal-Wallis test for all five data sets 
yields p-value = 0.0040, which confirms the significant dif- 
ferent of NVD.Advice and other data sets. 



To understand how entropies evolve, we divide the obser- 
vation period of each data sets into two parts, namely first- 
half and second half, then we calculate the median for two 
parts. A stable evolution would result that the entropy me- 
dian of the first-half is greater (or equal) the entropy median 
of the second-half. It is because the decreasing of entropy 
median means the VDMs become more stable as more data 
are available in the data set. 

Figure Is] shows the evolution of goodness-of-fit entropy for 
each data set. The solid lines indicate the grand medians for 
the whole observation periods. The dash lines denotes the 
medians of the first half periods and dotted lines illustrate 
the medians of the second half periods. Notice that the 
grand medians of the plot are exactly the medians of data 
sets illustrated in FigurelT] This is obvious since the FigurelTl 
is the summary view of Figure [S] 

Look at the trends of evolution, the entropy variation of 
NVD. Advice seems to be lesser than other data sets. More- 
over the second half median of NVD. Advice is less than the 
first haft median. So, NVD. Ad vice seems to be a good can- 
didate data set. The NVD.Nbug also has the median of the 
second half less than the median of the second half, but 
the entropy variation of NVD.Nbug in the second half looks 
bigger than that of NVD. Advice. In the opposite direction, 
NVD. Bug is very bad. The plot of entropy is very dynamic, 
especially in the second half. The entropy median increases 
in the second half period, which might imply a more insta- 
bility performance of VDMs. 

To ensure that NVD. Bug is a worst one, we additional 
employ one-side Mann-Whiney U test to perform pairwise 
tests between the entropies of NVD. Bug and others' with the 
alternative hypothesis "the entropy distribution of NVD. Bug 
is stochastic larger than others". Notice that, since multiple 
comparisons are employed, Bonferroni correction is applied 
with the number of tests n = 4, so the significance level 
a' = 005/4 = 0.0125. The result of these tests shows that 
the entropy variation of NVD. Bug is larger than NVD (p- 
value = 1.82-10"*), NVD. Advice {p-value = 1.3-10"^), and 
ADVICE. Nbug {p-value — 0.004). For the comparison test 
between NVD. Bug and NVD.Nbug, the p-value = 0.06 > a'. 
Even though it is not enough evidence to conclude, but it is 
very near to the point that the entropy variation of NVD. bug 
is larger than that of NVD.Nbug. 

In summary, all five analyzed data sets achieve a good 
stability for VDMs' goodness-of-fit performance in overall. 
The entropy of VDMs' performance is less than 0.2% (0 - 
for perfect stability, and 1 - for completely dynamic) for all 
data sets. Among the data sets, NVD. bug is the worst it is 
significantly more unstable than other data sets, and more 
importantly, NVD. bug tends to be more unstable when more 
data are available (i.e., entropy of the second-half period 
is greater than that of the first-half period). In the other 
side, NVD. Advice is slightly better than others. Even though 
there is no significant difference among medians, NVD. Advice 
is apparently the appropriate data sets for VDMs because 
VDMs' goodness-of-fits in these data sets are more stable 
when more data is available. 

9. THE TEMPORAL QUALITY OF VDM 

This section addresses the research question Rd4] To 
know which VDM is globally better than other, we ana- 
lyze the performance of VDMs in the life time of analyzed 
releases. We introduce another measurement for the per- 



formance of a VDM, namely goodness-of-fit quality Q (or 
quality for short). 

The quality of a VDM depends on how well it can fit 
the vulnerability data of analyzed releases. Thus this qual- 
ity measurement can vary over time since the vulnerability 
data evolve over time as we can see in the previous section 
(gSl. The VDM's quality at time t is measured by the ratio 
between the number of Good Fits (p-value of x^ goodness- 
of-fit > 0.95) by the total number of fits at time t. Besides, 
since we could not conclude about an Inconclusive Fit when 
its p-value ranged from 0.05 to 0.95, an Inconclusive Fit also 
contributes to the overall quality, but may be not as good as 
a Good Fit. Thus, we use an extra factor cu to denote that 
a Good Fit is a; times as good as an Inconclusive Fit. The 
formula is defined as follows. 



QUt) 



\Fit\t + V" ■ \Inconclusive\t 



(3) 



\Fit\t + \Inconclusive\t + \NotFit\t 
where \X\t is the number of times a VDM obtains goodness- 
of-fit X at time t (X is Fit, Inconclusive or Not Fit). Qui(t) 
is distributed from to 1. Qu,(t) = indicates a VDM does 
not fit any data; and, Qui(t) = 1 shows that a VDM can fit 
all data very well. 

Figure [9] shows the notched box plots of global goodness- 
of-fit quality of VDMs. Top are plots of VDMs' quality 
no mater what the data sets. Bottom are the similar plots 
but restricted to NVD. Advice data set only. To additionally 
evaluate how the difference between a Good Fit and an In- 
conclusive (u) factor) impacts the the final quality, left plots 
show the VDMs' quality where a Good Fit is as good as 
an Inconclusive (uu = 1); and right plots shows the VDMs' 
quality where a Good Fit is twice as good as an Inconclusive 
(u = 2). 

If we ignore the data sets (top plots). Roughly 75% of the 
case AML model has a better quality than other models. 
Meanwhile, the quality of AT model is the worst. This is 
true regardless the value of u). For other models, the plots 
shows that there is not much different among LN, LP and 
RE models. The RQ is slightly worse than others since the 
first quartile of the distribution is much lower than others'. 

Previous section has showed that NVD. Advice is slightly 
better than other data sets. So, in the bottom plots, we an- 
alyze the quality of VDMs in NVD. Advice. Here, we observe 
the same phenomenon for both AML and AT models. For 
other models, LP and RE look like the same; but LN and 
RQ models are slightly better. 

This result is quite compliant with the previous analysis in 
NVD data set at the time data collected (see i|6|. This would 
allow us to make stronger conclusions about the performance 
of analyzed VDM. 

First, AT model is absolutely not a right model for vul- 
nerability discovery process. It means that the rationale 
behind the AT model is not applicable for vulnerability de- 
tection. Besides, the two model LP and RE also do not 
obtain a good quality comparing to other models, especially 
the AML model. If we consider an Inconclusive Fit is half as 
good as a Good Fit, and we use the NVD. Advice, the qualities 
of these two models, LP and RE, are even lower than other 
models. Thus, AT, LP, RE model are indeed not good op- 
tions for vulnerability detection. These three models share a 
common point that is they are all based on reliability mod- 
els using to express the discovery process of normal bugs. 
The core difference between reliability models and VDM is 
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Top charts show the quaUty of VDMs in all data sets. Meanwhile, 
bottom eliarts illustrate the quality of VDMs in NVD.Advice data sets. 
In left charts, a Good Fit is as good as an Inconclusive Fit {uj — 1). 
And in the right charts, a Good Fit is twice as good as an Inconclusive 
Fit (lj = 2). 

Figure 9: The VDMs' goodness-of-fit quality. 



the motivation of detectors. In tlie former, detectors are 
software engineers (testers, quality control), so they only in- 
vest on finding bugs so that the reliability of an application 
reaches to a certain threshold. In the later, whereas, de- 
tectors are the whole community who are interested in the 
application. The motivation of finding a vulnerability there- 
fore much bigger than finding a normal bug, and also last for 
longer time (depend on the number of users of an applica- 
tion). Existing reliability-based VDMs which do not capture 
this phenomenon could not obtain good performance. 

Notably, the three model AT, RE, LP are both based on 
reliability models, but the RE and LP could obtain better 
performance that the AT model. This is true because of 
the shapes of each model. The shape of AT model does 
not express very well the first period of vulnerability dis- 
covery when vulnerabilities are found in an (approximately) 
linear manner. The RE and LP, whereas, can do this bet- 
ter. Hence, AT model fails most of the cases, and RE and 
LP models still obtain good results when the evolution of 
vulnerability is linear. 

Second, AML model is better than other model since its 
assumptions match better the actual behavior of community 
in finding vulnerability of a software. 

10. THREATS TO VALIDITY 

Bias in data collection. This work employs the same tech- 
nique discussed in [13] to parse HTML pages of MESA, 
and process the XML data of NVD and Bugzilla. Even 
though the collector tool has been checked for multiple 
times, it might contain bugs affecting to data collec- 
tion. 

Bias in bug-to-nvd linking scheme. While collecting data 
for ADVICE. Nbug, we apply some rules to link a bug to 
an nvd based on their locations in the MESA report. 



Nevertheless, this might be incorrect. We manually 
checked some links for the relevant connection between 
bug reports and NVD entries. They were found to be 
consistent. However, again, it might not be always 
true. 

Overestimation of number of bugs in each version. We 

do not know exactly which versions that a bug affects. 
Consequently, we assume that a bug affects all versions 
mentioned in the linked nvd. This might overestimate 
the number of vulnerabilities in each version. To mit- 
igate the problem, we calculate the latest release that 
a bug might impact, and filter all vulnerable releases 
after this latest. This calculation is done using the bug 
fixes mining technique discussed in [19) . 

Error in curve fitting. We estimate the goodness-of-fit of 
VDMs by using the Nonlinear Least-Square technique 
implemented in R (nlsO function). This might not 
produce the most optimal solution. That essentially 
impacts the validity of this work. To mitigate this 
issue, we additionally employ a commercial tool i.e., 
CurveExpert Prcrlto cross check the goodness-of-fit. 

Bias in statistic tests. Our conclusions are based on statis- 
tics tests. These tests have their own assumptions. 
Choosing tests whose assumptions are violated might 
end up with wrong conclusions. To reduce the risk 
we carefully analyzed the assumptions of the tests, for 
instance, we did not apply any tests with normality 
assumption since the distribution of vulnerabilities is 
not normal. 

11. CONCLUSION 

In this work we addressed a fundamental question in vul- 
nerability discovery modeling "do existing VDMs work?". 
We have conducted an experiment in which we fitted six 
existing VDMs [i.e., AML, AT, LN, LP, RE and RQ) to 
fifty eight data sets of seventeen releases of three popular 
web browsers IE, Firefox and Chrome. 

This experiment confirmed that the assumption behind of 
the AML model, which vulnerability discovery process follow 
three phases: learning, linear, and saturation, is more appro- 
priate to observed data. This idea apparently captures the 
way people discover vulnerability in practice. However, in 
the case of Firefox, since a large portion of the old code based 
is inherited in modern releases |14) . therefore many vulner- 
abilities of the very first releases {e.g., vl.O, vl.5) continue 
to increase after a period of saturation even though these 
releases are retired (out of support). It explains for the not 
fit results of all of VDMs in these releases since none of them 
is able to capture this phenomenon. 

In the opposite side, AT model performance is very poor. 
We can conclude that the assumption of this model is com- 
pletely not applicable for vulnerability detection. We spec- 
ulate that people are more passionate in finding vulnerabil- 
ities rather than normal bugs. Meanwhile software testers 
only focus on finding bugs until the reliability level of the 
software reaches to a certain threshold. This also explain 
for that AML is slightly better than other Reliability-based 
models (i.e., RE, LP). The performance of LN, RE, LP are 
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approximate because vulnerabilities of many of analyzed re- 
leases are more or less in the linear phase. 

The investigation on the evolution of goodness-of-fit en- 
tropy and quality reports a notable impact of the data set 
selection to the quality of a VDM, even though there is no 
statistically difference of the goodness-of-fit entropy among 
data sets. The NVD. Advice data set emerges as the best one 
in terms of entropy (even though slightly), and quality. 

However, this experiment is only based on one kind of 
application. This might limit the final result. Therefore, as 
part of future work, more similar experiments on other kinds 
of applications, e.g., operating systems, web server applica- 
tions, should be conducted in order to have solid conclusions. 
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